Clean data

Load libraries

Load data

Exploratory data analysis of raw data

We have two datasets. They describe proteins that are receptors or part of receptors. The metadata describes function, protein names and IDs, organism and sequence of the protein. The feature data contains information on length, molecular weight, secondary structure features and ratio of each amino acid.

[1] 63 28
[1] 63  8

We have information on 63 proteins and both datasets seem to contain information on the same number of proteins.

First we remove unwanted, empty columns from Metadata.

# A tibble: 6 × 6
  Protein_Name                   Protein_ID Sequence Organism Gene_Name Function
  <chr>                          <chr>      <chr>    <chr>    <chr>     <chr>   
1 Glutamate receptor ionotropic… A0A1L8F5J9 MGTMRLF… Xenopus  grin1     Compone…
2 5-hydroxytryptamine receptor … A5X5Y0     MEGSWFH… Homo     HTR3E     Forms s…
3 Ionotropic receptor 25a        E9NA96     MILMNPK… Drosoph… Ir25a     Integra…
4 Glycine receptor subunit alph… F1R8P4     MTRPSVK… Danio    glra2     Subunit…
5 Glutamate-gated chloride chan… G5EBR3     MATWIVG… Caenorh… glc-1     Glutama…
6 Betaine receptor acr-23_ERR    G5EG88     MHRIYTF… Caenorh… acr-23    Betaine…

Let’s take a look at what organisms our proteins come from.

We can see multiple species commonly used in molecular biology research, with the most proteins coming from Homo sapiens, some Rattus species, Caenorhabditis elegans and some Aspergillus species. However, we also notice that there is a species called Homo_ERR, which should probably be converted to Homo in the cleaning process.

Let’s find out if we have some recurring functions. For this, we will count the most frequent words used in the Function column (e.g. “receptor”, “serotonin”, etc.). This uses the SnowballC package, so don’t run unless the package is installed.

We see that the protein functions like “receptor”, “channel”, structural information such as “subunit” show up often as well as information on ion usage (“chlorid”, “cation”, “ionotropic”), or neurotransmitter (“glutamate”, “serotonin”).

We can also visualize the length of our protein sequences.

Merge datasets

Using bind_cols() would be enticing but is bad practice, because it blindly matches rows by their position. To avoid any potential problems arising from this, we create a .row_id key in each dataset and join them by that key.

# A tibble: 6 × 34
  Protein_Name            Protein_ID Sequence Organism Gene_Name Function length
  <chr>                   <chr>      <chr>    <chr>    <chr>     <chr>     <dbl>
1 Glutamate receptor ion… A0A1L8F5J9 MGTMRLF… Xenopus  grin1     Compone…    903
2 5-hydroxytryptamine re… A5X5Y0     MEGSWFH… Homo     HTR3E     Forms s…    456
3 Ionotropic receptor 25a E9NA96     MILMNPK… Drosoph… Ir25a     Integra…    947
4 Glycine receptor subun… F1R8P4     MTRPSVK… Danio    glra2     Subunit…    449
5 Glutamate-gated chlori… G5EBR3     MATWIVG… Caenorh… glc-1     Glutama…    461
6 Betaine receptor acr-2… G5EG88     MHRIYTF… Caenorh… acr-23    Betaine…    545
# ℹ 27 more variables: molecular_weight <dbl>, gravy <dbl>, aromaticity <dbl>,
#   instability_index <dbl>, helix_fraction <dbl>, turn_fraction <dbl>,
#   sheet_fraction <dbl>, aa_A <dbl>, aa_C <dbl>, aa_D <dbl>, aa_E <dbl>,
#   aa_F <dbl>, aa_G <dbl>, aa_H <dbl>, aa_I <dbl>, aa_K <dbl>, aa_L <dbl>,
#   aa_M <dbl>, aa_N <dbl>, aa_P <dbl>, aa_Q <dbl>, aa_R <dbl>, aa_S <dbl>,
#   aa_T <dbl>, aa_V <dbl>, aa_W <dbl>, aa_Y <dbl>

Clean data

Asses “dirtiness” of data

First we asses number of missing values.

Protein_Name   Protein_ID     Sequence     Organism    Gene_Name     Function 
           0            0            0            0            0            1 
[1] 1

From this we can tell that our dataset only contains one NA value.

However, we know that some of the string values have an ‘_ERR’ attached, to signify missing values. We should also check how many of those exist.

# A tibble: 1 × 6
  Protein_Name Protein_ID Sequence Organism Gene_Name Function
         <int>      <int>    <int>    <int>     <int>    <int>
1            4          1        1        1         4        3

So in all character columns there are 1-4 values that are marked as error values. To create this matrix, we used pivot_longer() to be able to create a column for each variable. From the missingness matrix there seem to be no features with particularly high missingness and missingness appears to be random, meaning there are no obvious batch effects or other confounders influencing missingness.

Next, we analyze outliers in our numeric data.

To make outliers of different variables comparable to one another, values have been log-normalized.

These last steps show all of the impurities that exist in our data, missing values for character data and numerical data, as well as outliers for numerical data. In the following steps we will show how to deal with them.

Remove dirtiness of data

First we will start with cleaning string data by removing the “_ERR” at the end of a string. We do this using the mutate()-function taught in class.

# A tibble: 1 × 6
  Protein_Name Protein_ID Sequence Organism Gene_Name Function
         <int>      <int>    <int>    <int>     <int>    <int>
1            0          0        0        0         0        0

Next, we will deal with the numeric outliers in our data. For all other numeric variables, we identify them using the IQR-method that boxplots use as well. Since the outliers are artificial, it gives no biological insight to keep them, so we will impute them using the median-method. We prefer the median-method over the mean-method here, because the latter is much more sensitive to large outliers, which is what we have in our data.

There is an important distinction between the numerical variables: the ones that start with “aa” reflect the ratio of a given amino acid in the protein sequence. So we can provide an even more precise method of dealing with outliers, that is more accurate than median imputation. We can simple recalculate the ratios from the sequence provided in our metadata.. Afterwards we can check if we did a good job by adding the ratios together and seeing whether they add up to 1.

# A tibble: 1 × 1
  aa_ratio_sum
         <dbl>
1            1

They do add up to one in every case, so it is likely that we managed to fix all the amino acid ratio values.

We can look at the boxplots again:

As is visible, the outliers have been successfully dealt with.

Lastly, we save out cleaned data to our data-folder.